Appendix A — cut()

For numeric variables in a data frame, it can sometimes be useful to split the values into intervals and create a new factor with numerical levels. For example, if we wanted to identify high, mid and low levels of hdl in chol_full.

The function that can do this in R is cut(). The arguments that cut() takes are:

We can split hdl into three levels using the following code.

cut(x = chol_full$hdl, breaks = 3, include.lowest = TRUE)
 [1] [24.9,55.7] [24.9,55.7] (55.7,86.3] [24.9,55.7] (86.3,117]  [24.9,55.7]
 [7] (55.7,86.3] (55.7,86.3] [24.9,55.7] [24.9,55.7] (55.7,86.3] (86.3,117] 
[13] [24.9,55.7] (55.7,86.3]
Levels: [24.9,55.7] (55.7,86.3] (86.3,117]

This tells us that the lowest level is the range [24.9, 55.7], the middle level is (55.7, 86.3] and the highest level is (86.3, 117]. We can also see which level each row falls into, the first two rows being in the low level for hdl, the third row being in the middle level and so on.

We can then add this as a new factor variable, hdl_level, and represent each level with the labels "low", "mid" and "high" using the code below.

chol_full$hdl_level <- factor(cut(x = chol_full$hdl, breaks = 3, include.lowest = TRUE),
                              labels = c("low", "mid", "high"))

head(chol_full)
    id ldl hdl trig age gender     smoke weight height      bmi hdl_level
1 P912 175  25  148  39 female        no  90.77   1.69 31.78110       low
2 P215 196  36   92  32 female        no  75.06   1.75 24.50939       low
3 P063 139  65   NA  42   male      <NA>  73.99   1.84 21.85432       mid
4 P117 162  37  139  30 female ex-smoker  86.25   1.83 25.75473       low
5 P613 140 117   59  42 female ex-smoker  76.95   1.81 23.48829      high
6 P332 147  51  126  65 female ex-smoker  57.66   1.75 18.82776       low